arXiv: 1505.07254v 1 [cs.DM] 27 May 2015 


Differentially Private Response Mechanisms on 
Categorical Data 


Naoise Holohan a , Douglas J. Leith a , Oliver Mason b ’* 

a School of Computer Science and Statistics, Trinity College Dublin, Ireland 
b Dept. of Mathematics and Statistics/Hamilton Institute, Maynooth University-National 
University of Ireland Maynooth, 

Co. Kildare, Ireland 


Abstract 

We study mechanisms for differential privacy on finite datasets. By deriving 
sufficient sets for differential privacy we obtain necessary and sufficient con¬ 
ditions for differential privacy, a tight lower bound on the maximal expected 
error of a discrete mechanism and a characterisation of the optimal mechanism 
which minimises the maximal expected error within the class of mechanisms 
considered. 
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1. Introduction 

Data privacy has been of interest to researchers for decades [5], but high- 
profile privacy breaches in recent years, such as those involving AOL [1] and 
Netflix [16], have renewed focus on the topic. The movement towards smart 
metering systems for electricity, water and other utilities and the greater use of 
data mining in so-called smart cities and transport have given rise to further 
concerns over personal data privacy. 

The most traditional framework for the study of data privacy is that of 
tabular data. A simple model of this type considers the data to be arranged 
as individual records within a table, where each record contains entries from 
some underlying dataset, which may be continuous or discrete depending on the 
type of data being studied. Simple anonymisation techniques, such as removing 
names and social security numbers (so-called unique identifiers) from the data, 
have been shown to be inadequate [17]. More sophisticated frameworks such as 
fc-anonymity [17] and ^-diversity [13] are also vulnerable to privacy attacks via 
the use of appropriate side-information or data from external sources [13, 12]. 


* Corresponding author. Tel.: +353 (0)1 7083672; fax: +353 5(0)1 7083913; email: 
Oliver. mason® nuim. ie 


Preprint submitted to Elsevier 


May 28, 2015 





Within the last decade, differential privacy [7] has emerged as a popular 
framework for research in the field of data privacy based on its capability to 
provide a quantifiable basis for privacy preserving data publishing and mining. 
This is a probabilistic approach to data privacy in which a suitably randomised 
version of the correct response to a query is released. The core idea is founded 
on the simple premise that the response to a user query should not be too tightly 
coupled with any one entry in the table. One widely-adopted implementation 
of differential privacy for real-valued databases is to add an appropriate amount 
of noise sampled from a Laplace distribution to each cell of the database [6]. 

Much research on differential privacy to date has been completed on real¬ 
valued databases [6], although a considerable body of literature also exists on 
discrete data [14, 3]; in particular some recent work has focussed on graph data 
relevant to applications in areas such as social networks [11, 2]. 

Differentially private mechanisms can be divided into two distinct classes: 
sanitisation based mechanisms; and output perturbation based mechanisms. 
Our concern here is with the former class, which first constructs a sanitised 
version of the database and then answers queries on this. It has been shown 
in [10] that if the sanitised database satisfies the requirements of differential 
privacy, then any query can be answered on it in a differentially private manner. 

In writing this paper, we have two aims: the first is to present a set of new 
results on the mathematical foundations of differential privacy for discrete data; 
the second is to bring the problems in this field to the attention of researchers 
in discrete applied mathematics. 

We examine differentially private mechanisms for discrete data within the 
general probabilistic framework described in our previous paper [10]. As we deal 
with finite datasets here, many of the measure-theoretic details required for the 
more general setting can be suppressed. However, to properly set context, we 
include the more general definitions here in Section 2. 

Our first results concern an adaptation for discrete data of the exponential 
mechanism introduced by McSherry and Talwar. In particular, we consider the 
problem of sufficient sets for differential privacy for this mechanism. This prob¬ 
lem is motivated by the practical issue of testing whether or not a mechanism 
is differentially private and arises from the following simple considerations. 

For a sanitisation to be differentially private, certain inequalities (described 
formally later) must hold on all subsets of the database space, which can ne¬ 
cessitate checking a prohibitively large collection of sets in order to test for 
differential privacy. The question of sufficient sets asks whether it is sufficient 
for the differential privacy condition to hold on a collection of these subsets for 
it to hold on all subsets. We can therefore reduce the workload required to check 
that a mechanism satisfies differential privacy. In Section 3, we present results 
characterising sufficient sets for the discrete exponential mechanism. We then 
use these to give necessary and sufficient conditions for differential privacy for 
this mechanism. 

A major concern of privacy research is the trade-off between privacy and 
accuracy. For the current setting, in the absence of a given metric on the 
dataset, we measure the error of a sanitisation using hamming distance; in 
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Theorem 5 we derive a tight lower bound on the maximal expected error of a 
discrete exponential mechanism. 

In Section 4 we consider a seemingly unrelated approach to database saniti- 
sation: product sanitisations. We show that these are in fact equivalent to the 
discrete exponential mechanism constructed using the hamming distance and, 
building on results in [10], we characterise differential privacy and the error 
for these in Theorems 8 and 9 respectively. Finally in Theorem 10 we pro¬ 
vide a characterisation of the optimal product sanitisation mechanism, which 
minimises the maximal expected error within the class of product sanitisations 
(and hence within the class of discrete exponential mechanisms). Concluding 
remarks are given in Section 5. 

1.1. Related work 

Before the advent of differential privacy, Fienberg examined the use of data 
swapping and cell supression for privacy protection on categorical data [8]. 
Dwork then presented the notion of differential privacy in [7], and it limita¬ 
tions were discussed by Dankar in [4], including its applicability to categorical 
data. 

Dwork’s work was closely followed by McSherry and Talwar who proposed 
the exponential mechanism in [14]. An instantiation of this was used by Hardt 
and Talwar [9] in examining the geometry of differential privacy. Mohammed 
made use of the exponential mechanism for releasing count queries in [15, 3], 
while a more recent contribution has looked at differential privacy on counts 
using a combination of the Laplace and exponential mechanisms [18]. 

2. Preliminaries 

2.1. Database model 

We consider a finite data set D with (m + 1) elements (m > 1). A database d 
with n rows drawn from this data set is represented by a vector d = (di,..., d n ) £ 
D n . D is equipped with a cr-algebra, in this case the power set 2 D and D n in¬ 
herits the product cr-algebra, 2 D . We are therefore considering all subsets of 
D and D n . 

We will consider hamming distance on D n . Recall that the hamming dis¬ 
tance, h : D n x D n {0,1,..., n}, between two databases is the number of 
rows on which they differ: 


h(d,d') = |{i : d t ± d' t }\. 


( 1 ) 


We say two databases d, d' e D n are neighbours, written d ~ d', if h(d, d') = 
1, i.e. they differ on exactly one row. 


3 



2.2. Query model 

We make use of the generalised query model introduced in [10], adapted to 
the discrete setting. A query Q : D n Eq outputs a response in Eq, the 
structure of which is not specified (it may be numeric, categorical, functional, 
etc). Eq is, however, equipped with a a- algebra Aq. We require that all queries 
be measurable, which is trivial in this setting since Q~ 1 (A) C D n for all A £ Aq. 

2.3. Response mechanism 

Let (72, T, P) be a probability space. We define a response mechanism for a 
query Q to be the family of measurable mappings 


{X Q , d : SI -»■ Eq | d G D n }. (2) 

For simplicity, when the query in question is the identity query 7, we denote 
X/. d by X d : 71 —> D n . Where there is no ambiguity, the response mechanism 
will be written as {XQ, d } (or {X d } when dealing with the identity query). 

In this paper we deal exclusively with sanitised response mechanisms , where 
Xq d = Q o X d . In this case the mechanism is generated by first sanitising the 
database d and then answering queries on the sanitised database X d without 
any further modification to the data. Hence, a sanitised response mechanism 
for a query Q is the family of mappings 

{X Q , d = QoX d :Q^E Q \d£ D n }. (3) 

One mechanism which we will make use of in this paper is the exponential 
mechanism, as described by McSherry and Talwar [14]. The following is the 
exponential mechanism written in our notation, reformulated to deal specifically 
with discrete data. 

Definition 1 (Exponential mechanism) Given a query Q, a query output 
space Eq, a utility function (which measures the utility of all possible query 
answers to the database being queried) u : D n x Eq —> R, a measure p : Eq —»• R 
and a normalisation constant C d , the exponential mechanism is defined to be 
the family of mappings {XQ d : 72 —> Eq | d £ D n }, with probability density 
function with respect to p, given by C( i 1 e eu< ‘ d ’ q ^ for each d £ D n and for q £ Eq. 

Comment: If Eq is discrete, we can specify p by {p(q) : q £ Eq} and the 
exponential mechanism has probability mass function 

nXQ, d = q) = C^e^ d ^p{q), (4) 


where q £ Eq. 
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2-4- Differential privacy 

We now define what it means for a response mechanism in our framework 
to be differentially private. 

Definition 2 (Differential privacy with respect to a query) Let a query 
Q : D n —> Eq and parameters e > 0 and 0 < 6 < 1 be given. A response 
mechanism {Xq d : Q —> Eq | d € D n } is (e, 6)- differentially private if 

P(A Q ,d G A) < e e F{X QM , G A) + (5) 

for all d ~ d' £ D n and for all measurable A C Eq. 

Note that the relation d ~ d' is symmetric, so (5) must hold when d and d' 
are swapped. 

Note: McSherry and Talwar showed that the exponential mechanism (Def¬ 
inition 1) satisfies 2eAu-differential privacy (5 = 0), where 

Art = max |tt(d, q) — u(d / , q)\. 

d~d'GD n ,geEQ 

Theorem 4 of [10] simplifies the problem of checking differential privacy for 
sanitised response mechanisms. We now recall that theorem. 

Theorem 1 (Identity query) A sanitised response mechanism which is (e, 5)- 
differentially private with respect to the identity query is (e, 5)-differentially pri¬ 
vate with respect to any query Q. 

We therefore need only examine the response mechanism for the identity 
query, 

{A d : ft -> D n | d € D n }. (6) 

Example 1 (Categorical data I) Suppose the data we are interested in records 
individuals ’ favourite hobby. The data set D would contain a list of all hobbies. 
For simplicity in this example, we restrict answers to the following five hobbies: 
Sports; Cars; Television; Computer games; and Reading. Hence, m = 4. Each 
database d would contain the favourite hobby of n individuals. If n = 6, one 
possible d could be represented by the following list: Sports; Computer games; 
Television; Sports; Reading; Television. 

Queries on such databases could include counting the number of unique hob¬ 
bies (4 in the case above) or how many list ‘Television’ as their favourite hobby 
(2 in the case above). The identity query would be another valid query. 

3. Sufficient sets for discrete exponential mechanism 

In order for a response mechanism to be deemed (e, 5)-differentially private, 
(5) must hold for all pairs of neighbouring databases and for all possible subsets 
of D n . If we were to check all combinations, this would require checking all 
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nm(m + l) n pairs of neighbouring databases on 2^ m+1 ^ — 2 subsets of D n 
(all subsets except D n itself and 0). Therefore, checking a discrete response 
mechanism for differential privacy requires nm(rn + l) n (2( m+1 ) — 2) checks in 
total. 

However, for a given pair of neighbouring databases, it is not always nec¬ 
essary to check (5) on all subsets of D n . So we ask the question: For a given 
d ~ d', what is the smallest collection of subsets of D n that we need to check for 
(5) to hold on all subsets of D n ? We call such a collection of subsets sufficient 
sets of the mechanism for d ~ db 

In this section, we examine the sufficient sets for a class of discrete response 
mechanisms and show that, even in the most general cases, significant improve¬ 
ments on workload can be made when checking for differential privacy. We also 
present conditions that are necessary and sufficient for differential privacy to 
hold. This compares to the mostly sufficient conditions presented in other dif¬ 
ferential privacy literature, which can therefore give a conservative estimate on 
the privacy level achieved. 

3.1. General response mechanism 

We begin by considering the exponential mechanism described by McSherry 
and Talwar [14], as detailed in Definition 1. We wish to assign a probability 
to each database based on its utility to the input/reference database. This is 
determined by the utility function, which can be a metric or any other function 
deemed suitable for a particular application. As discussed in Section 2.4, we are 
only concerning ourselves with the identity query. 

Ordinarily, to minimise error, we would want to assign the reference/input 
database itself the highest probability of being returned, with decreasing likeli¬ 
hood the further we move away from the reference, as determined by the utility 
function. 

Definition 3 (Discrete exponential mechanism) Let u : D n x D n —> R be 

given. The discrete exponential response mechanism is defined to be a family 
of measurable mappings 


{X d : n D n I d € £>"}, 


where each X d satisfies 

P(X d = d') = Cff 1 e u( ' d ' d '\ (7) 

for all d, d' € D n . As D n is finite, we can define the normalisation constant 
C'd as 

c d = e “ (d,d,) 

d'6D“ 

for each d e D n . 
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Remark: While similar to the exponential mechanism (4), the discrete 
exponential mechanism differs by dealing only with the identity query (Q = I), 
by having a uniform measure (/i = 1) and by absorbing e into u, in order to 
allow consideration of (e, ^-differential privacy. 

Even with this general set-up, we can still make improvements on workload 
when checking for differential privacy. We begin by defining the following set 
for each pair of neighbouring databases d ~ d' £ D n : 

S d , d ’ = {d* € D n | P(A d = d*) > F(X d , = d*)} (8) 

= jd* e D n I C' d ~V (d ’ d * ) > C'J, 1 e“ (d '' d * ) } 

This set is a collection of the “worst-case” databases for d and d', and, as we 
show in Theorem 2, is the only set of interest when checking for differential 
privacy. 

Theorem 2 (Sufficient sets) Let {A d } be a discrete exponential mechanism 
and fix d ~ d' £ D n . If (5) holds on all A C S d d , then it will hold on all 
A C D n . 

Proof: We fix d ~ d' £ D n , let A C D n be given and assume (5) 

holds on all subsets of 5 d , d '. By assmption, (5) holds on d 0 = d fl 5 d , d ', hence 
P(A d € A)) < e e F(X d > £ d 0 ) + <5. If A 0 = A we are done, so assume d\d 0 ^ 0. 

For each d* £ T\do, P(X d = d*) < P(X d / = d*). Pick one such d^ £ d\do, 
then 

P(X d £ Aq U {d*}) = P(X d € do) + P(X d = dS) 

< e e P(A' d / £ do) + S + P(A d = d^) 

< e e P(A' d ' £ do) + S + P(A d / = do) 

< e c P(A d / £ do U {d ( *}) + S. 

Hence, (5) holds on A\ = d 0 U {dg}. We can similarly show that (5) holds 
on d 2 = di U {dj} for any dj S d \ do By repeating this process (picking 
d* £ A \ di), we can show that (5) holds on A i+ 1 = dj U {d*} for each i. 

Since d is finite ( D n is finite), this process will eventually terminate when 
d,; = d, i.e. i = |d \ d 0 |. Hence, (5) will hold on d as required. □ 

We now look at a simple example to demonstrate the impact of Theorem 2. 

Example 2 (t 1 norm) For this example, we consider a discrete exponential 
mechanism where D = {0,l,2},?i = 2 and it(d, d') = — ||d — d'||i. In this 
case, \D n \ = 9, and we are therefore required to check 2 9 — 2 = 510 subsets (all 
subsets of D n except 0 and D n itself) for every pair of neighbouring databases 
d ~ d' £ D n , meaning a total of 510 x 36 = 18,360 checks. 

Let d = (°) and d' = (f), then 5 d , d ' = { (q) > (?) i ( 2 )} • Hence, for this 
particular pair of neighbouring databases, it is sufficient to check that (5) holds 
on just 2 3 — 1 = 7 subsets (all subsets of S d , d i except 0). 
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If we choose d = (\) and d' = Q, then we get S d ,d' = {Q, (?), (°), ({), Q)}- 

This gives a total of 2 6 — 1 = 63 subsets to check. 

By populating the entire set of databases, we can show that 21 pairs of neigh¬ 
bour databases require 1 subset checks, while the remaining 15 pairs require 63 
checks. That leaves us with a total of 1,092 subset checks to verify differential 
privacy, compared with 18,360 without the use of Theorem 2. 

3.2. Response mechanism with fixed C d 

By Theorem 2, we know that, for a discrete exponential mechanism, checking 
that (5) holds on all subsets of S' d . d ' is equivalent to checking all subsets of D n . 
However, if C d is fixed for all d £ D n , we can partition S'd.d' to reduce our 
workload further. Let us first consider the following set relating to S'd.d': 

a = {u(d, d*) - u( d', d*) | d* £ S d ,d'}. (9) 

There are only finitely many such values in a, since Sd.d' is finite. We label 
these elements aq, ol^, ..., ot s , with 0 < ay < < ■ ■ ■ < a s . 

We then partition Sd.d' into the collection of subsets {S’, d ,, S d d ,, • • • > S d d'l 
as follows: 

V = {d*eS d ,d' | u(d,d*)-u (d',d*) = «*}. (10) 

Note that for each i and for all d* £ S’ d d ,, 

P(X d ' = d*) = Ce“ (d '' d * ) 

_ oti 

= e- ai P(X d = d*). (11) 

We can now show that if (5) holds on these partitions, it will hold on all 
subsets of each partition, i.e. the partitions are sufficient sets of themselves. 

Theorem 3 Fix d ~ d' £ D n and let {Wd} be a discrete exponential mecha¬ 
nism with C d = C for all d £ D n . If (5) holds on S d d , then it will hold on all 
A C Sj d , for each i. 

Proof: Fix d ~ d' £ D n . We assume 

nXd G Sj, d< ) < e e P(X d ' £ Sj id ,) + <5, 

and let A C S dd ,. By (11), P(Af d / £ A) = e _ai P(Af d £ A). Since this also 
holds for the set S d d , itself, we have 

c 

1 < e e ~ ai + ---. 

P(X d £ S* d d> ) 

Clearly P(X d £ A) < P(X d £ S dd ,), which gives 


1 < e e ~ ai 
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P(X d £ A) ’ 



or, rewriting, 


P(X d G A) < e £ P(X d / G A) + <5. 

Hence, (5) holds on all AC S‘ dd ,. □ 

When (5 = 0, Theorem 3 tells us that these partitions S d d , of S'd^' are all 
we need to check to verify e-differential privacy, i.e. the sufficient sets are the 
collection of subsets {S d d ,,..., S d d ,}. 

Corollary 1 (Sufficient sets with fixed C d ) Let {A d } be a discrete expo¬ 
nential mechanism, <5 = 0 and fix d ~ d' G D n . If (5) holds on S d d , for all i, 
then it will hold on all subsets of D n . 

Proof: Let A C ,S' d d ' and assume that (5) holds on S d d , for all i. Hence, 
by Theorem 3, (5) holds on all subsets of S' d d , for all i. As {5 dd /} partitions 
Sd,d', A = \J t An S l dd ,, hence P(X d Gifl S l dd ,) < e £ P(A d , G A n S l dd ,) for 
all i. 

Since 5* >d , n S j d d , = 0 when i ± j, P(X d G [j.A n S* )d ,) = E.P^d e 
A n S d d ,), and so, 


P(A d G A) = P \^X d G |J A n Sj >d , 

= ^p(i d GinSj ]d ,) 

i 

<e e ^p(i d -GinSj id ,) 

i 

= e £ P(A d , G A). 

Therefore (5) holds on all subsets of S d d ', and by Theorem 2, it holds on all 
subsets of D n . ’ □ 

3.3. Discrete exponential mechanism with hamming distance 

The usefulness of Theorem 3 becomes particularly apparent when we re¬ 
strict our response mechanism to one derived from hamming distance. For the 
remainder of this section our utility function u is defined to be 

u(d.d') = —fch(d.d'), (12) 

where k > 0 is a privacy parameter and h(d.d') is the hamming distance be¬ 
tween d and d' (the number of elements on which they differ, see (1)). Note 
that for this set-up, the normalisation constant C d = C is fixed for all d G D n . 

Proposition 1 For a discrete exponential mechanism satisfying (12), 

g > = c =( 1 + ?)~”- < 13 > 
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Proof: Let d G D n be given. Then, 


J2 P(^d = d') = Cde“ feh(d ' d,) 

d'er> n d'eD n 

= C d e° + C d nme~ k + Cd m 2 e~ 2k 
+ • • • + C d m n e~ nk 


i—0 v 

-«■(*+?) 


„—ik 
m e 


For the probability mass function of Ad to sum to 1, we need 




= 1 . 


Rearranging this completes the proof. □ 

In this case, the sufficient sets for the problem condense down to a single 
set. 


Corollary 2 (Sufficient sets for hamming distance) Consider a discrete ex¬ 
ponential mechanism satisfying (12). If (5) holds on Sd.d', then it will hold on 
all AC D n . 

Proof: First note that, for all d ~ d' G D n , 

h(d',d*)-h(d,d*)G {-1,0,1}, 

hence for all d* G S'd.d', h(d',d*) — h(d, d*) = 1. As U is linear, the set a 
reduces to a singleton set, 

a = {—fch(d, d*) + fch(d / , d*) | d* G S'd.d'} 

= m, 

and hence Sd.d' = d ,. 

We can then conclude that if (5) holds on Sd.d', it must hold on all subsets 
of Sd.d' (by Theorem 2) and also on all subsets of D n (by Theorem 3). □ 

We have established that, for the discrete exponential mechanism satisfying 
(12) and for a given neighbouring pair of databases d ~ d', to check for differ¬ 
ential privacy on all possible subsets of the database space D n , we need only 
check a single set S'd.d'- 

It is now a relatively simple task to establish conditions on the response 
mechanism for differential privacy. For the following theorem, we assume that 
5 < 1 (note that all mechanisms are trivially (e,l)-differentially private). 
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Theorem 4 (Condition for differential privacy) Let S <1.4 discrete ex¬ 
ponential mechanism {A'd} satisfying (12) is (e, S)-differentially private if and 
only if 


e < 


mS 


1 -5 


(14) 


Proof: Fix d ~ d' G D n . By Corollary 2 we need only satisfy (5) on Sd,d r 
for it to hold on all A C D n . Hence we need 


P(A’d G Set, d') ^ e e P(A'd' G S’d.d') + S. 


By definition, P(Ad' G Sd,d') = e fc P(Ad G Sd,d')> therefore the mechanism 
will be differentially private if and only if 


1 < e e ~ k + 


S 

P(A4 G 5 d ,d0' 


(15) 


Now consider P(Ad G S'd.d')- By definition, each d* G S'd.d' lies a hamming 
distance of h(d, d*) + l from d'. Therefore, the number of d* where h(d, d*) = c 
is ( n f 1 )m c (c elements must be changed, but not the element on which d and 
d' differ, to ensure h(d, d*) + 1 = h(d', d*)). Hence, 


nXdeSd,d') = cJ 2 


i =0 


n — 1 


m i e~ ik 


= C 


( m \ n— - 1 

( 1 + ^) 

m\~ 1 

Substituting this result into (15) gives 

1 < e—* + <(l + jj£) , 


and solving for e k completes the proof. □ 

Remark: For (e,0)-differential privacy, we require k < e. 

Discussion: If we convert our discrete exponential mechanism back into the 
form of the exponential mechanism, McSherry and Talwar [14] tell us that the 
mechanism satisfies 2e-differential privacy at worst. However, we have shown 
in Theorem 4 that the mechanism satisfies e-differential privacy, and that this 
condition is tight (necessary and sufficient, hence we can do no better). This 
improves on the looser bound in McSherry and Talwar’s proof, as it underesti¬ 
mates the differential privacy achieved by a factor of two. Their mechanism is 
also limited to e-differential privacy only (5 = 0), whereas we can account for 
(e, ^-differential privacy (S > 0). 

We define the error of a mechanism, £, to be the largest mean hamming 
distance between the input and sanitised databases: 


£ = max E[h(Ad, d)|. 

dt=nn 


(16) 
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Using the differential privacy constraints established in Theorem 4, we can now 
determine error bounds of the mechanism. 


Theorem 5 (Error) The error, £, of a discrete exponential mechanism satis¬ 
fying (12) and which is (e, 5)-differentially private, satisfies 


1 — S £ m 

-- < - < -. 

1 + — n m+l 

m 


(17) 


Proof: Let d € D n . Then P(X d = d') = Ce fch ( d ’ d ) and |{d' : h(d, d') = 
c}| = (™)m c for all c G (0,1,... ,n}. Hence, 


mfe ik 


E[h(X d ,d)] = C^ 

z=0 




7=0 


n — 1 


(?)' 


= Cri 


(, my 
( 1 + ^) 
TO /. TO \ —1 

l ek 


l + S 


On average, the number of entries that will change is therefore 

£ _ maxdgpn E[h(X d ,d)] _ 1 

n n 1 + — 


Since the response mechanism is differentially private, Theorem 4 tells us 
that e k < e yzjt 1 anc ^ ^ — 0 by definition, hence, 

1 — 5 £ to 

-- < - <- . 

1 + m n m.+ l 


□ 

Remark: This lower bound bound is tight and can be achieved by setting 
k = In ^ e ^ (see Theorem 4). 

4. Product Sanitisation 

The results of Section 3 give a clear framework on how to create differentially 
private mechanisms for any type of discrete data, particularly categorical data 
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using hamming distance, and to obtain tight (necessary and sufficient) condi¬ 
tions for differential privacy. However, this mechanism requires the creation of 
a separate probability distribution for each of the (m + l) n unique databases. 

In this section, we present a simple method for realising the discrete expo¬ 
nential mechanism. 

4-1- Response mechanism 

A product sanitisation mechanism is one where the database is sanitised 
row-by-row. The following is the definition of such a mechanism, following the 
same notation as in [10]. 

Definition 4 (Product sanitisation mechanism) Given a 1-dimensional re¬ 
sponse mechanism for the identity query, {Xd : Q —> D \ d G D} (the parent 
mechanism), the product sanitisation response mechanism is defined to be the 
set of measurable mappings 

(A' d : Q -> D n | d G D 11 } 

given by 

X d = (X 1 d ,...,X%) i (18) 

where the X d are independent and each X d has the same distribution as X di , 
for all d G D n , i G (1,..., n}. 

In this setting, each row of the sanitised database is represented by an 
independent random variable, as if the database represents n 1-dimensional 
databases, each sanitised independently. Realising this framework therefore 
only requires the creation of m + 1 probability distributions which are then 
copied to create distributions on each database. 

We now recall Theorem 5 of [10], which states that differential privacy on 
a product sanitisation mechanism is guaranteed when its parent mechanism is 
differentially private. 

Theorem 6 Consider a family {X^ | d G D} of measurable mappings and 
assume that 

P(X^ G A ) < e e P(X d' G A) + 6 , 

for all d, d! G D and all A C D. Let {Ad | d G D n } be a product sanitisation 
mechanism. Then 

P(X d G A) < e e P(A d ' G A) + 6, 
for all d ~ d' G D n and all A C D n . 

The converse of this result was proven in Lemma 3 of [10]. However, we 
are able to take a simpler approach to showing this converse by exploiting the 
finiteness of D. 
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Corollary 3 (Parent mechanism) A product sanitisation mechanism, {A d | 
d £ D n } is (e, 6)-differentially private if and only if its parent mechanism {Xd | 
d £ D} is (e, 5)-differentially private. 

Proof: Theorem 6. 

Let A C D and d ± d' £ D be given. Define A’ = A x fi"" 1 C D n 
and d,d' £ D n , such that d\ = d, d{ = d' and d, = df for all i £ {2, ... ,n}, 
hence d ~ d'. Since {JVTd} is differentially private by assumption, 

P(X d € A’) < e £ P(A d , £ A') + 6. 

However, since {A d } is a product sanitisation mechanism, 

n 

P(X d e A') = P(xi ei)x JJp(xj e D) 

i=2 

= P(*i e A) 

= P(-X"di € A). 

Hence, 

P(^"d € A) < e e P(AV £ Ad) + 6 , 

for all d ^ d' £ D and A C D, as required. □ 

Hence, for a product sanitisation mechanism to be differentially private, we 
need only show that its parent mechanism is differentially private. 

For the remainder of this section, given 0 < p < m j ) _ 1 , the parent mechanism 
{Xd | d £ D} of the product sanitisation is defined such that 

P(A d = d) = 1 - pm, P(A d = d!) = p, (19) 

for every d' £ D \ {d}. 

Remark: p = represents the case of releasing uniform noise (i.e. no 
information), while decreasing p reduces the error of the mechanism. 

Remark: By requiring p < we get P (Xd = d) > P (Xd = d') for every 

d! £ D. Therefore, the mechanism is at least as likely to return the correct 
answer as any one incorrect answer, an entirely reasonable assumption. 

Example 3 (Categorical data II) Using the same set-up as in Example 1, 
one such permissable p would be p — 0.1, since 0 < 0.1 < Then, if d were 
to be the value ‘Television’, and d' the value ‘Cars’, P (Xd = d) = 0.6, and 
P(X d = d')=0.1. 

We then sanitise the n-row database d in the same way, working through the 
database one row at a time. 

In the first main result of this section, we show that the product sanitisa¬ 
tion mechanism and the discrete exponential mechanism are equivalent, despite 
being constructed in different ways (this is subject to the discrete exponential 
mechanism satisfying (12) and the product sanitisation mechanism satisfying 
(19)). 
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Theorem 7 (Equivalence) Let {A'd | d £ D n } be a discrete exponential 
mechanism satisfying (12), and (Id | d £ D 11 } be a product sanitisation re¬ 
sponse mechanism, whose parent mechanism satisfies (19). Then the probability 
mass functions of Ad and Yd are identical, for every d £ D n , when 


e 


k 


1 

P 


— m. 


( 20 ) 


Proof: Let d £ D n , then 


P(A d = d') = (l + ^) ” e - fch(d ' d,) , 

( p \ h(d.d') 

P(y d =d') = (l-pm)" —t - 

\ 1 — pm J 


for all d' £ D n . 

For the two mechanisms to be equivalent, we need = e~ fe , or, rewriting, 

1 -pm _ e k 
P ' 

We also need 1 — pm = —, which can be rewritten as — = 1 + ^, or 

1 e K +m 1 1 —pm e K 1 

1 — pm k 

P 

Hence, the mechanisms{Ad} and {Yd} are equivalent when e k = ^ — m or 
P = ■ □ 


f.2. Alternative proofs of Theorems ) and 5 

We have already established that the discrete exponential mechanism satis¬ 
fying (12) and the product sanitisation mechanism satisfying (19) are equivalent, 
meaning the results of Theorems 4 and 5 also apply to the product sanitisation 
mechanism in this particular set-up. In this sub-section, we provide alterna¬ 
tive methods of proof for these theorems that make use of the specific product 
sanitisation of the mechanism, for the additional insight provided. 

The proof of the following theorem first appeared in [10], but is included 
here for completeness. 


Theorem 8 (Condition for differential privacy) A product sanitisation mech¬ 
anism, whose parent mechanism satisfies (19), is (e, 5)-differentially private if 
and only if 


P > 


1-5 
e e + m 


( 21 ) 


Proof: By Corollary 3, we need only be concerned with proving differential 
privacy on { Xd \ d £ D} for it to hold on {Ad | d € D n }. 

Assume {Xd} is differentially private and let d ^ d' £ D be given. 
Applying the definition of differential privacy to the singleton set A = {g?} gives 


1 — pm. < e e p + 6. 


Rearranging this gives (21). 
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“<*=”: Assume (21) holds and let d d! G D and A C D be given. There 
are four cases to consider on A: 

1. d, d! £ A: Then P{X d G A) = P(X d ' € A) = p|A| and differential privacy 
holds trivially. 

2. d,d' G A: Then P(X d G A) = P(X d , G A) = p(|A| - 1) + 1 - pm = 
p(\A\ — m — 1) + 1 and differential privacy holds trivially. 

3. d' G A, d £ A: Then P(X<j G A) = p|A| and P(A<r G A) = p(|A| — 1) + 1 — 
pm, so P(X d G A) < P(Arf/ G A) and differential privacy holds. 

4. rf G A,<i' ^ A: Then 


P(Xrf g A) = p(|A| — to — 1) + 1, 
P(X d / G A) = p|A|. 


From (21), we have 


1 — pm < e e p + 6 

= e e (p\ A| - p\A\ + p) + 6 
< e e p\A\ — p{\A\ - 1) + <5, 

since \A\ > 1 (d G A by hypothesis). Rewriting the above, 
p(\A\ — m — 1) + 1 < e e (p|A|) + 5, 
hence P(A^ G A) < e e P(Ad' G A) + 6 and differential privacy holds. 

□ 

Remark: For (e,0)-differential privacy, we require p > . 

Remark: By definition, p is bounded from above by pAp-. so for differential 
privacy we require 

1 A 1 

-< p <--• 

e e + m m + 1 

We now look at an alternative proof of Theorem 5 which established error 
bounds on the mechanism. 

Theorem 9 (Error) The 

parent mechanism satisfies 


error, £, of a product sanitisation mechanism, whose 
(19) and is (e, S)-differentially private, satisfies 


1 — S £ to 

-- < - <-. 

1 + — n to. + 1 


( 22 ) 
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Proof: Let d G D n . Then, 


£ = max E [h(Xd, d)l 

rlfn» L J 


= max E 

d £D" 


J2HX l d ,di) 

_i= 1 


n 

= N X d.^)] 

i =1 


= max 

d<EL> n 


Y.nxa^di) 

i—1 


= n max (1 — IP(Xd = d)) 

' deD 


= npm. 


On average, the number of entries that will change is therefore 

£ 

— = pm. 
n 

As the mechanism satisfies (e, ^-differential privacy, p > by Theo¬ 

rem 8, and since p < by definition, 

1 — S £ to 

-- < - < - . 

1 + — n m+ 1 

m 

□ 

Remark: As with the discrete exponential mechanism, this bound is tight 
and can be achieved by setting p equal to the lower bound established in The¬ 
orem 8. 


4-3. Optimal mechanism 

We now look at the second main result of this section. Using the same 
error definition as before, (16), we show how to construct the optimal (e, S)- 
differentially private mechanism which produces the minimum error. For the 
purpose of this subsection, we assume the following labelling of elements in the 
data set: 

D = {di,d,2, ■ ■ ■ ,d m +i}- (23) 

Definition 5 (Solution Matrix) The parent mechanism {Xd} of a product 
sanitisation mechanism can be defined by a stochastic matrix 

P M) G [0, l]( m + 1 ) x ( m +i) j 

where 

P {X ii =d j ) = [P^ s) ]. j . (24) 

We refer to P( e ,8) as a product sanitisation solution matrix if the mechanism it 
defines is (e, S)-differentially private. 
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Remark: A parent mechanism which satisfies (19) is represented by the 
following solution matrix: 


[ P M )\ij 


1 — mp if i = j, 
P if i 7 ^ j- 


(25) 


Theorem 10 (Optimality) Let {Xd} be a parent mechanism which satisfies 
(19), with p = . The error, £ = maxdezj™ E[/i(Ad, d)], produced by its 

product sanitisation mechanism }Xd} is the minimum of all product sanitisation 
mechanisms. 

Proof: Let A = P( e ,g), satisfying (25) with p = be the solution 

matrix of {Xd}, be. P(Xj. = dfi) = ajj. Note that this mechanism has the 
property that ajj = e e a^ + 5, since ajj = and = e }~ S m , for all j. 

The error of its product sanitisation mechanism {Ad}, using the same method 
as in the proof of Theorem 9, is 


max E[h(Xd, d)l 

df=n" 


n ^max (1 — P(A<j = d))^ 


= n 

= n 


(max(l - au)j 

) 


( 


1 — min a , . 


= n( 1 - an), 


for all i, since an = ajj = 1 — mp for all i and j. 

Let B be a solution matrix defining the parent mechanism {Yd}, with a 
corresponding product sanitisation mechanism {Id}, where B ^ A. Since B is 
a solution matrix, it is stochastic QTh bjj = 1 for all i) and the parent mechanism 
it defines is (e, (^-differentially private (P(Yd £ A) < e e P {Yd' C A) + 6, for all 
d,d' G D and AC D). 

Since A^fi B and A, B are stochastic, there exists at least one pair (■ i*,j *), 
where bj*j » < a^j*. There are two cases to consider: 

1. i* = j*: The error of {Yd} is then 

max E[h(Yd, d)] = n (1 — min bu ) 

> n (1 — bj*j .) 

> n(l — aj*j*) 

= max E[h(Xd, d)l. 
d er> n 

2 . i* j*-. As noted previously, aj*j » = e e a,;*j» + 6, and since {Yd} is (e, 5)- 
differentially private, P(Y^.» = dj*) < e e P(Yd.„ = dj*)+5, or alternatively, 
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bj*j* < e e bi*j* + 6. Therefore, 

bj*j • < e t b l *j* + S 
< e e ai*j* + S 
= a,j*j *. 

Hence, bj*j* < , and case 1 applies. 

We therefore conclude that 

max E[h(ld-d)l > max E[h(Xd, d)l, 

d£D» d£D» 

and that {Ad} is the product sanitisation mechanism which produces the opti¬ 
mal error. □ 

Using Theorem 10, we now have a simple method to construct the most 
accurate (e, ^-differentially private mechanism possible for discrete data. We 
have proven that no other product sanitisation mechanism is more accurate 
than it. It follows from Theorem 7 that we now know the optimal discrete 
exponential mechanism that satisfies (12). 

5. Conclusion 

We study mechanisms for (e, ^-differential privacy on finite datasets that 
are an adaptation for discrete data of the exponential mechanism introduced 
by McSherry and Talwar. By deriving siLfficient sets for differential privacy we 
obtain necessary and sufficient conditions for differential privacy, a tight lower 
bound on the maximal expected error of a discrete mechanism and a character¬ 
isation of the optimal mechanism which minimises the maximal expected error 
within the class of mechanisms considered. 
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